1. The Deep Learning Revolution

Example of how to learn a function that fits synthetic data

We take a set of \(N\) points that are spaced uniformly on an interval \([a,b]\). We then generate a value through a function \(\sin{(2\pi x)}\). We then add some random noise (normally distributed) to each point to obtain the target value \(t_n\) for each point we chose.

Our objective will be to predict the value \(\hat{t}\) from a new input value \(\hat{x}\) of the input variable.

We consider the most basic way of fitting these points to a function: polynomial interpolation!

After interpolating from our data, we should obtain a function with the following form:

$$ y(x, \textbf{w}) = w_0 + w_1x + w_2x^2 + \dots + w_Mx^M = \sum_{j = 0}^M w_jx^j$$ where \(M\) is the degree (order) of the polynomial. The coefficients of this polynomial are going to be collected in a vector \(\textbf{w}\). Since \(y\) is linear in the coefficients, we say that it is linear. In general, all functions that are linear in the unknown parameters are called linear models.

To learn the weights, we will try to minimize an error (loss or objective) function. A common one is:

\[ E(\textbf{w}) = \frac{1}{2}\sum_{n=1}^N(y(x_n, \textbf{w}) - t_n)^2 \]

When we minimize \(E\), we will get an optimal value for the weights that we will call \(\textbf{w}^*\). Our polynomial will then be \(y(x, \textbf{w}^*)\).

We then need to choose the order for our polynomial. To do this, we choose different \(M\) and see which one fits nicest to our data. We notice as we increase \(M\), the function fits the data perfectly but does not resemble our target function (\(\sin{(2\pi x)})\). ![[Screenshot 2025-09-10 at 3.44.12 PM.png]] To account for these differences, we usually test our model's ability to generalize by testing it on a test set which is comprised of data points which we have not previously seen. We will then calculate the error on the training set and also on this testing set.

Introducing a new error function \(E(\textbf{w})\) that has the advantage to be sometimes more convenient our previous one called the root-mean-square (RMS) error which is defined as:

\[ E_{RMS} = \sqrt{\frac{1}{N}\sum_{n=1}^N(y(x_n,\textbf{w}) - t_n)^2} \]

This new error function has the following advantages: 1. The division by \(N\) allows us to compare datasets with different sizes 2. The square root ensures that our error is in the same units as our target variable \(t\)

If we compare this error of the model for different \(M\) on the test and training set we notice the following trend: ![[Screenshot 2025-09-10 at 3.50.50 PM.png]] We see that as \(M\) increases the error on the training set goes to 0 as \(M = 9\). However, the error on the test set increases. It initially lowers to reach a minimum at \(M = 3\) but then increases as \(M\) increases.

What is happening here?

The model is overfitting to the training data. We know from mathematics that a unique polynomial can be interpolated from 10 points of degree less than or equal to 9. This is why we have an error of \(0\) as \(M = 9\) on the training set. However, if we look at our function, it exhibits great swings and variations between points making it poor at generalizing to the overall trend. This explains the increase in error on the test set.

Additionally, as the size of the test set increases, we see that our model generalizes much better and stops overfitting.![[Screenshot 2025-09-10 at 5.04.52 PM.png]] There is a heuristic that says that the size of the data set should be no less than some multiple (5 or 10) of the number of learnable parameters in the model.

That rule of thumb is only really applicable to classical statistics. Deep learning models tend to have many more learnable parameters than the number of data training points.

Regularization

This helps avoid overfitting while not having to limit the number of parameters. It usually involves adding a penalty term to the error function to discourage coefficients from having large magnitudes.

The most common way is by adding a little fudge factor that is the sum of all the squares of the coefficients. It gives a modified error function:

$$ \tilde{E}(\textbf{w}) = \frac{1}{2}\sum_{n = 1}^N(y(x_n, \textbf{w}) - t_n)^2 + \frac{\lambda}{2}||\textbf{w}||^2 $$ where the coefficient \(\lambda\) determines the relative importance of the regularization term and \(||\textbf{w}||^2=\textbf{w}^{\text{T}}\textbf{w}\) . ![[Screenshot 2025-09-11 at 5.30.39 PM.png]] We see that for lower values of \(\lambda\) that we get a closer fit to the actual function! When plotting RMS error for training set and test sets against \(\ln{\lambda}\) , we see that \(\lambda\) effectively determines the complexity of the model and the degree of over-fitting.

![[Screenshot 2025-09-11 at 5.32.27 PM.png]] \(\lambda\) is known as a hyperparameter that is fixed during the minimization of the error function to determine the model's parameters \(\textbf{w}\).

The usual process is determined by the following procedure: 1. Determine the weights on the training set 2. Validate the model on the validation set 3. Choose the model that has lowest error on the validation set 4. If the validation set is too small, use a larger 3rd test set

If data availability is a major constraint, we will want to use as much as possible for the training of our model. If the validation set is too small, we will get a relatively poor estimate of our predictive performance. A solution is a technique called cross-validation.

This technique uses a proportion \((S-1)/S\) of the available data to be used for training while making use of all of it for testing. If we set \(S = N\) (with \(N\) being the size of the the whole dataset), we get the leave-one-out technique.

These techniques are not really helpful as we work with larger datasets and more hyperparameters as it requires exponentially more runs to fit the model

Brief History of Machine Learning

The Neuron

These can be described mathematically by the following function:

$$ a = \sum_{i = 1}^M w_ix_i $$ $$ y = f(a) $$ where \(x_1, \dots, x_M\) represent \(M\) inputs corresponding to activities of other neurons that send connections to this neuron and \(w_1, \dots, w_M\) are continuous variables called weights. The quantity \(a\) is the pre-activation, the nonlinear function \(f(\cdot)\) is the activation function and \(y\) is called the activation.

Single-Layer Networks

The perceptron is a type of single layer neural network that has an activation function with the following functional form:

$$ f(a) = \begin{cases} 0, \text{ if } a \leq 0, \ 1, \text{ if } a \geq 0 \end{cases}$$ The perceptron algorithm guarantees that if there exists a set of weight values that the perceptron can achieve perfect classification on its training data then the algorithm is guaranteed to find the solution in a finite number of steps (developed by Rosenblatt 1962).

Perceptrons were limited by the lack of effective training algorithms.

Backpropagation

Solution to the training problem was given by using gradient-based optimization methods. This meant using continuous differentiable activation functions with non-zero gradients. They also introduced error functions that define how well a choice of parameters predicts targets on the training set.

![[Screenshot 2025-09-11 at 7.56.05 PM.png]] This is an example of a feed-forward neural network.

To train this network, parameters are first initialized stochastically and then are iteratively updated using gradient-based optimization techniques. The derivatives of the error functions are calculated efficiently using error backpropagation. Here, the information flows from the outputs to the inputs. The most well known of these algorithms is stochastic gradient descent.

Prior knowledge or Inductive biases can be used to further learn insights about the data.

In deep neural networks, the hidden layers can perform representational learning where they can learn how to transform input data into a new representation that is more semantically rich and is easier for the final layer/s to solve. This is what allows deep neural networks to be used for transfer learning. These models that can be fine-tuned for downstream tasks are called foundational models.